Distributed Shared Memory System with Fault Detection Support
نویسنده
چکیده
Distributed shared memory (DSM) system provides simplicity of shared-memory programming on cluster of computers. Compared to hardware approach, software DSM is a relatively low-cost, low-risk entry into the concurrent computing arena. However, fault detection becomes essential when software DSM is used to execute long-running applications. When one of the remote processor crashes or the process running on it fails, the system should provide the application with the ability to be aware of the failure and to take appropriate actions, such as cleaning up, terminating the program, and migrating the job. TreadMarks is a state-of-the-art software DSM implementation developed on standard UNIX systems. However, it does not provide fault detection support due to its usage of simple, connectionless UDP protocol. In this project, the UDP protocol used by ThreadMarks was changed to TCP protocol. TCP is connection-oriented, reliable, and of byte stream type. A fault detection mechanism using the error codes returned by the TCP socket layer was implemented. By replacing UDP with TCP, the complexity of ThreadMarks implementation can be reduced by removing redundant reliability services previously used in TreadMarks. By exploiting error codes for fault detection, we avoid any potential communication overhead because those error codes are automatically generated by TCP protocol. Our experiments show that the impact of changing UDP to TCP on the system performance depends on the scale of the application and the synchronization technique adopted by the application. Shorter-running applications and applications relying on locks for synchronization would suffer more than longer-running ones and those relying on barrier for synchronization. Our experiments also show that the fault detection using error codes returned by TCP does not incur any overhead.
منابع مشابه
Distributed Cactus Stacks: Runtime Stack-Sharing Support for Distributed Parallel Programs
Parallel Programming Systems based on the Distributed Shared Memory technique has been promoted as easy to program, natural and equivalent to multiprocessor systems. However, most programmers find this is not the case. The shared memory in DSM systems do not have the same access and sharing semantics as shared memory in real multiprocessor systems (shared memory multiprocessors). We present a s...
متن کاملUsing Peer Support to Reduce Fault-Tolerant Overhead in Distributed Shared Memories
We present a peer logging system for reducing performance overhead in fault-tolerant distributed shared memory systems. Our system provides fault-tolerant shared memory using individual checkpointing and rollback. Peer logging logs DSM modification messages to remote nodes instead of to local disks. We present results for implementations of our fault-tolerant technique using simulations of both...
متن کاملA Dissertation Submitted to the Department of Computer Science and the Committee on Graduate Studies of Stanford University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
Current shared-memory multiprocessors suffer from an inherent fragility, since a single hardware or system software failure can cause the entire machine to crash. This dissertation describes a combination of hardware and software techniques that can be used to provide fault containment for large-scale shared memory machines. With fault containment, the impact of a fault remains limited to only ...
متن کاملCommunication infrastructure for IFATIS distributed embedded control application
In the paper, a communication infrastructure for faulttolerant embedded control system is discussed. It was developed as a part of experimental platform of the IFATIS project within the 5th European Union framework programme, dealing with dependable, reconfigurable integrated control systems. In such environments, a reliable and robust physical interconnection is required. The communication is ...
متن کاملReplication for Efficiency and Fault Tolerance in a Dsm System
Distributed Shared Memory (DSM) systems implemented on a network of workstations (NOW) have become a convenient alternative to shared memory archi-tectures to execute long running parallel applications. However, such architectures are susceptible to experience failures. This paper presents the design and implementation of a recoverable DSM (RDSM) based on a backward error recovery (BER) mechani...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002